Exercise 1

Load the dataset CensusTweets.Rdata into your R-session, and get an overview of the contained variables. What do the variables quote_count and reply_count describe? Why do we have missing data in them?

To load the data, you can use the load() function, to get an overview of the contained variabled, you can simply use colnames(). To find out more about what the variables mean, you can try to find the twitter data-dictionary online to find out how a tweet object is structured.

# Loading dataset
load("../data/CensusTweets.Rdata")

# overview of columns
colnames(Census_tweets)

From the Twitter API documentation:

quote_count: Integer Nullable. Indicates approximately how many times this Tweet has been quoted by Twitter users. Example:"quote_count":33 Note: This object is only available with the Premium and Enterprise tier products.

reply_count: Integer
Number of times this Tweet has been replied to. Example:"reply_count":30 Note: This object is only available with the Premium and Enterprise tier products.

Exercise 2

We only want to keep the variables screen_name, user_id, status_id,created_at,text and source. Create a new dataframe called Selection containing only these variables.

You can use the subset() function to only keep the variables you are interested in

# selecting variables
Selection <- subset(Census_tweets, select = c("screen_name",
                                              "user_id",
                                              "status_id",
                                              "created_at",
                                              "text",
                                              "source"))

# checking selection
head(Selection,5)

Exercise 3

Check the class of the variable created_at in your new dataframe. Is this class suitable for further analysis? If not, change the class to the appropriate one and compute the time difference between the tweet in the first row and the tweet in the last row.

To check the class of the created_at variable, you can use the class() function. To check the formatting of the Tweet timestamp, you can check the Twitter Documentation. To transfoms character strings into datetime objects in R, you can use the base function as.POSIXct() or the more convenient anytime() function from the package with the same name.

# Checking class
class(Selection$created_at)

# transforming to datatime object
library(anytime)
Selection$created_at <- anytime(Selection$created_at,asUTC = TRUE)
class(Selection$created_at)

# computing time difference
Selection$created_at[1] - Selection$created_at[dim(Selection)[1]]

Exercise 4

Check the tweet text varaible of the tweet in the sixth row of your Selection dataframe. Is it ready for text analysis or does it still contain things that need to be removed? If so, remove them from the whole column and save the results in a new variable called textClean.

You can use multiple functions from the qdabRegex package to remove hyperlinks and hashtags from the tweet text. You can search for the appropriate functions in the documentation which can be found here

# loading package
library(qdapRegex)

# removing hyperlinks
textClean <- rm_url(Selection$text)

# removing hashtags
textClean <- rm_hash(textClean)

# checking results
View(textClean)